Attention to Detail

Marketing Analytics Process

Quarto

Quarto is a powerful typesetting tool. Instead of writing the report or presentation in Word, PowerPoint, or a Google Doc; running code in an R script; saving output and visualizations; and pasting them in the report or the presentation – we can do it all in Quarto.

  • Markdown: Simple typesetting syntax.
  • Quarto: Allows us to include code cells (in R, Python, SQL, etc.) and output as part of the document.
  • Pandoc: Magic that outputs the Quarto document as nearly anything.

YAML

The header at the top of a Quarto document coded in YAML (i.e., Yet Another Markup Language).

---
title: "Exercise 5"
author: "Marc Dotson"
format: docx
---

While Quarto can produce HTML documents, PowerPoint slides, slides (like these), GitHub documents, blogs, webpages, etc. – unless specified otherwise, all subsequent assignments will be submitted as Word documents rendered from a Quarto document.

Quarto Basics

  • Use ## headings and ### sub-headings to clearly identify sections.
  • Code cells begin with ‘‘‘{r} and end with ‘‘‘ and should include comments but not text.
  • Use a separate code cell for every question so that output and text can be provided in-line.
  • Produce bullet points using -s.
  • Identify functions and data with `s, italics with *s, and bold with **s.
  • Quarto also has a visual editor.
  • There are options, including running all previous code or a single code cell.
  • Pay attention to required white space.
  • Render the document to produce the specified output.
  • Always read the output.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

crm_data <- read_csv("customer_data.csv") |> 
  left_join(read_csv("store_transactions.csv"), join_by(customer_id))
## Rows: 10531 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): gender, married, college_degree, region, state, review_time, review...
## dbl (5): customer_id, birth_year, income, credit, star_rating
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 10531 Columns: 169
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (169): customer_id, jan_2005, feb_2005, mar_2005, apr_2005, may_2005, ju...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Structure and Summarizing

How can we summarize the transactions for 2018 by region?

crm_data |> 
  select(region, customer_id, contains("2018"))
## # A tibble: 10,531 × 14
##    region    customer_id jan_2018 feb_2018 mar_2018 apr_2018 may_2018 jun_2018
##    <chr>           <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1 South            1001        1        0        0        0        0        4
##  2 West             1002        0        0        0        0        0        0
##  3 South            1003        4        0        0        0        0        0
##  4 Midwest          1004        0        0        0        2        0        0
##  5 West             1005        0        1        0        0        0        0
##  6 Midwest          1006        0        0        0        0        0        0
##  7 Midwest          1007        0        0        0        4        0        0
##  8 South            1008        0        0        0        0        0        0
##  9 West             1009        0        0        0        0        1        0
## 10 Northeast        1010        0        0        0        0        0        0
## # ℹ 10,521 more rows
## # ℹ 6 more variables: jul_2018 <dbl>, aug_2018 <dbl>, sep_2018 <dbl>,
## #   oct_2018 <dbl>, nov_2018 <dbl>, dec_2018 <dbl>

Tidy Data

Tidy data is defined as follows:

  • Each observation has its own row.
  • Each variable has its own column.
  • Each value has its own cell.

This may seem obvious or simple, but this common philosophy is at the heart of the tidyverse. It also means we will often prefer longer datasets to wider datasets and {tidyr} will help us move between the two.

Pivot Longer

The most common identifying feature of messy data is when column names are really values.

crm_data |> 
  select(region, customer_id, contains("2018"))
## # A tibble: 10,531 × 14
##    region    customer_id jan_2018 feb_2018 mar_2018 apr_2018 may_2018 jun_2018
##    <chr>           <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1 South            1001        1        0        0        0        0        4
##  2 West             1002        0        0        0        0        0        0
##  3 South            1003        4        0        0        0        0        0
##  4 Midwest          1004        0        0        0        2        0        0
##  5 West             1005        0        1        0        0        0        0
##  6 Midwest          1006        0        0        0        0        0        0
##  7 Midwest          1007        0        0        0        4        0        0
##  8 South            1008        0        0        0        0        0        0
##  9 West             1009        0        0        0        0        1        0
## 10 Northeast        1010        0        0        0        0        0        0
## # ℹ 10,521 more rows
## # ℹ 6 more variables: jul_2018 <dbl>, aug_2018 <dbl>, sep_2018 <dbl>,
## #   oct_2018 <dbl>, nov_2018 <dbl>, dec_2018 <dbl>

When column names are really values, the data frame ends up being wider than it should be. Use pivot_longer() to pivot the data frame longer by turning column names into values.

crm_long <- crm_data |>
  select(region, customer_id, contains("2018")) |> 
  pivot_longer(
    -c(region, customer_id),
    names_to = "month_year",
    values_to = "transactions"
  )

Note how much longer the data frame is and why.

crm_long
## # A tibble: 126,372 × 4
##    region customer_id month_year transactions
##    <chr>        <dbl> <chr>             <dbl>
##  1 South         1001 jan_2018              1
##  2 South         1001 feb_2018              0
##  3 South         1001 mar_2018              0
##  4 South         1001 apr_2018              0
##  5 South         1001 may_2018              0
##  6 South         1001 jun_2018              4
##  7 South         1001 jul_2018              0
##  8 South         1001 aug_2018              0
##  9 South         1001 sep_2018              0
## 10 South         1001 oct_2018              0
## # ℹ 126,362 more rows

Now summarizing the transactions for 2018 by region is trivial.

crm_long |> 
  group_by(region) |> 
  summarize(
    total_transactions = sum(transactions),
    avg_transactions = mean(transactions)
  )
## # A tibble: 4 × 3
##   region    total_transactions avg_transactions
##   <chr>                  <dbl>            <dbl>
## 1 Midwest                10092            0.764
## 2 Northeast              29958            0.774
## 3 South                  10501            0.788
## 4 West                   47375            0.775

Pivot Wider

If the data has the opposite problem and has values that should really be column names, use pivot_wider() to pivot the data frame wider by turning values into column names.

crm_long |> 
  pivot_wider(
    names_from = month_year,
    values_from = transactions
  )
## # A tibble: 10,531 × 14
##    region    customer_id jan_2018 feb_2018 mar_2018 apr_2018 may_2018 jun_2018
##    <chr>           <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1 South            1001        1        0        0        0        0        4
##  2 West             1002        0        0        0        0        0        0
##  3 South            1003        4        0        0        0        0        0
##  4 Midwest          1004        0        0        0        2        0        0
##  5 West             1005        0        1        0        0        0        0
##  6 Midwest          1006        0        0        0        0        0        0
##  7 Midwest          1007        0        0        0        4        0        0
##  8 South            1008        0        0        0        0        0        0
##  9 West             1009        0        0        0        0        1        0
## 10 Northeast        1010        0        0        0        0        0        0
## # ℹ 10,521 more rows
## # ℹ 6 more variables: jul_2018 <dbl>, aug_2018 <dbl>, sep_2018 <dbl>,
## #   oct_2018 <dbl>, nov_2018 <dbl>, dec_2018 <dbl>

Separate Columns

If two (or more) values are in one column, separate() the values into two (or more) columns.

crm_long
## # A tibble: 126,372 × 4
##    region customer_id month_year transactions
##    <chr>        <dbl> <chr>             <dbl>
##  1 South         1001 jan_2018              1
##  2 South         1001 feb_2018              0
##  3 South         1001 mar_2018              0
##  4 South         1001 apr_2018              0
##  5 South         1001 may_2018              0
##  6 South         1001 jun_2018              4
##  7 South         1001 jul_2018              0
##  8 South         1001 aug_2018              0
##  9 South         1001 sep_2018              0
## 10 South         1001 oct_2018              0
## # ℹ 126,362 more rows

crm_long <- crm_long |>
  separate(month_year, c("month", "year"), sep = "_")

crm_long
## # A tibble: 126,372 × 5
##    region customer_id month year  transactions
##    <chr>        <dbl> <chr> <chr>        <dbl>
##  1 South         1001 jan   2018             1
##  2 South         1001 feb   2018             0
##  3 South         1001 mar   2018             0
##  4 South         1001 apr   2018             0
##  5 South         1001 may   2018             0
##  6 South         1001 jun   2018             4
##  7 South         1001 jul   2018             0
##  8 South         1001 aug   2018             0
##  9 South         1001 sep   2018             0
## 10 South         1001 oct   2018             0
## # ℹ 126,362 more rows

Now we can summarize the transactions for 2018 by month and region.

crm_long |> 
  group_by(month, region) |> 
  summarize(
    total_transactions = sum(transactions),
    avg_transactions = mean(transactions)
  ) |> 
  arrange(desc(avg_transactions))
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
## # A tibble: 48 × 4
## # Groups:   month [12]
##    month region    total_transactions avg_transactions
##    <chr> <chr>                  <dbl>            <dbl>
##  1 nov   South                   1812            1.63 
##  2 dec   Midwest                 1723            1.56 
##  3 dec   West                    7922            1.55 
##  4 dec   Northeast               4928            1.53 
##  5 dec   South                   1693            1.52 
##  6 nov   West                    7656            1.50 
##  7 nov   Northeast               4824            1.50 
##  8 nov   Midwest                 1646            1.50 
##  9 apr   Midwest                  751            0.682
## 10 may   South                    749            0.674
## # ℹ 38 more rows

Unite Columns

When two (or more) values should be in one column, unite() the values into one column.

crm_long |>
  unite("month_year", c(month, year), sep = "_")
## # A tibble: 126,372 × 4
##    region customer_id month_year transactions
##    <chr>        <dbl> <chr>             <dbl>
##  1 South         1001 jan_2018              1
##  2 South         1001 feb_2018              0
##  3 South         1001 mar_2018              0
##  4 South         1001 apr_2018              0
##  5 South         1001 may_2018              0
##  6 South         1001 jun_2018              4
##  7 South         1001 jul_2018              0
##  8 South         1001 aug_2018              0
##  9 South         1001 sep_2018              0
## 10 South         1001 oct_2018              0
## # ℹ 126,362 more rows

Of all the {tidyr} functions, unite() might seem the least useful. However, we really would like month_year to be an actual date. If we can add a day, we can use {lubridate} to create a date.

crm_long <- crm_long |>
  mutate(day = 1) |> 
  unite("date", c(day, month, year), sep = "-") |> 
  mutate(date = dmy(date))

crm_long
## # A tibble: 126,372 × 4
##    region customer_id date       transactions
##    <chr>        <dbl> <date>            <dbl>
##  1 South         1001 2018-01-01            1
##  2 South         1001 2018-02-01            0
##  3 South         1001 2018-03-01            0
##  4 South         1001 2018-04-01            0
##  5 South         1001 2018-05-01            0
##  6 South         1001 2018-06-01            4
##  7 South         1001 2018-07-01            0
##  8 South         1001 2018-08-01            0
##  9 South         1001 2018-09-01            0
## 10 South         1001 2018-10-01            0
## # ℹ 126,362 more rows

With a long data frame and a date column, we can plot a time series of transactions.

crm_long |> 
  group_by(date, region) |> 
  summarize(
    total_transactions = sum(transactions),
    avg_transactions = mean(transactions)
  ) |> 
  ggplot(aes(x = date, y = avg_transactions, color = region)) +
  geom_line() +
  scale_x_date(date_breaks = "month", date_labels = "%b")

## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.

Data Classes and Types

We’ve been using data frames (technically tibbles, a modern data frame). A data frame is composed of columns called vectors. Both data frames and vectors are classes of data.

Each vector has a single data type. We’ve discussed double (i.e., numeric), integer, date, character, and factor. If we try to mix data types in a vector, it will pick the easiest one to satisfy.

vector_example <- c(1, 2, "three")

vector_example
## [1] "1"     "2"     "three"

Data frames are nice to work with because each vector can be of a different data type.

tibble(id = 1:3, state = "AZ")
## # A tibble: 3 × 2
##      id state
##   <int> <chr>
## 1     1 AZ   
## 2     2 AZ   
## 3     3 AZ

Coercion

Sometimes we need to coerce a data class or type.

as_tibble(vector_example)
## # A tibble: 3 × 1
##   value
##   <chr>
## 1 1    
## 2 2    
## 3 three

Why would we want to coerce a data class?

We can similarly coerce data types with as.*() functions (e.g., as.numeric() and as.character()).

Coercing dates can be tricky.

We often want to coerce factors using the fct_*() functions.

Why would we want to coerce a data type, especially factors?

crm_data |> 
  mutate(region = fct_infreq(region)) |> 
  ggplot(aes(x = region)) +
  geom_bar()

Note that geom_bar() is a wrapper for both count() and geom_col() (i.e., only a single variable is needed and the count is performed as part of the plot).

We’ve already used fct_reorder() to coerce a factor ordered by another variable. Now we’ve used fct_infreq() to coerce a factor ordered by frequency and, if you like, fct_rev() to reverse that order.

crm_data |> 
  mutate(region = region |> fct_infreq() |> fct_rev()) |> 
  ggplot(aes(x = region)) +
  geom_bar()

Wrapping Up

Summary

  • Discussed the philosophy of tidy data.
  • Practiced {tidyr} functions for tidying data.
  • Considered data classes, types, and coercion.

Next Time

  • Querying databases.

Supplementary Material

  • R for Data Science (2e) Chapters 6 and 30

Artwork by @allison_horst

Exercise 5

Customers are often analyzed based on those who have made recent purchases, frequent purchases, and spent the most. This is known as a recency, frequency, monetary (RFM) analysis. Now that we can tidy data, we can analyze customers based on recent purchases and frequent purchases. As you work, remember to sketch out what you’d like the data to look like. In RStudio on Posit Cloud, create a new Quarto document and do the following.

  1. Let’s define customers who have made recent purchases as anyone who has made more than 10 transactions in 2018. Report on the composition of these customers.
  2. Let’s define customers who have made frequent purchases as anyone who has made more than 150 transactions total. Report on the composition of these customers.
  3. Render the Quarto document into Word, export the Word document, and upload to Canvas.